Implement Multithreading for Enhanced Performance in Custom Check Processing #284

rajeshpandey2053 · 2024-04-24T19:57:42Z

Description:

This pull request introduces a multithreading solution to enhance the performance of custom check processing in the codebase. The existing codebase comprises various rules applied to multiple fields. During execution, it became evident that parallel processing is necessary at two levels: the field level and the argument level.

At the field level, a check needs to be performed for multiple fields simultaneously. Meanwhile, at the argument level, a check traverses through multiple arguments within a single field. Therefore, nested multithreading is required to fully improve the overall performance of the project.

By leveraging multithreading at both levels, we aim to parallelize the execution of checks, thereby significantly improving performance and efficiency, especially in scenarios involving a large number of checks or resolving URLs.

Changes:

Introduces new functions process_argument in custom_checker.py and _process_field in checker.py to segregate the processes for parallel execution.
Refactors the existing loop in the run function to utilize multithreading for parallel execution of custom checks.
Enhances code readability and maintainability by separating concerns and encapsulating logic into reusable functions.
Adds appropriate documentation and comments to explain the multithreading implementation and its benefits.
Additional error handling for multi-threaded modules

Testing:

Extensive testing has been conducted to ensure the correctness and performance of the multithreading solution.
Integration tests have been performed to validate the code's functionality in various scenarios and edge cases.

Impact:

This change significantly improves the performance of custom check processing, especially in scenarios involving a large number of checks or resolving URLs.
Improved the efficiency of the execution on average from ~100s to ~10s

…ion and readability

…ssing

xhagrg · 2024-04-25T15:40:46Z

pyQuARC/code/checker.py

+            for field_dict in list_of_fields_to_apply:
+                executor.submit(
+                    self._process_field,
+                    func,
+                    check,
+                    rule_id,
+                    metadata_content,
+                    field_dict,
+                    result_dict,
+                    rule_mapping,
+                )


Wouldn't running this in parallel lead to missing values? Eg:
result_dict = {}

Running this method in parallel will pass result_dict to the number of parallel method calls. when updating parallelly, wouldn't result_dict be missing some elements?

Not sure if pass by value takes care of the issue. proper testing (unit and manual testing) is required.

Did manual unit and integration tests with a curated list of concept ids, results obtained from pyquarc with and without using multithreading are exactly same.

xhagrg · 2024-04-25T15:42:11Z

pyQuARC/code/custom_checker.py

+        with ThreadPoolExecutor() as executor:
+            future_results = []
+            for arg in args:
+                future = executor.submit(
+                    self._process_argument,
+                    arg,
+                    func,
+                    relation,
+                    external_data,
+                    external_relation,
+                    invalid_values,
+                    validity,
+                )
+                future_results.append(future)
+
+            # Retrieve results from futures
+            for future in future_results:
+                invalid_values, validity = future.result()


Wouldn't sub-threading be an issue?

Did manual testing with a curated list of concept ids, no issue at all.

rajeshpandey2053 added 3 commits April 24, 2024 08:31

Refactor _run_function method in checker.py for better code organizat…

20741e8

…ion and readability

use ThreadPoolExecutor for parallel processing in checker

c27bd69

Refactor CustomChecker class to use multithreading for argument proce…

befddc5

…ssing

rajeshpandey2053 requested review from xhagrg, slesaad and jenny-m-wood April 24, 2024 19:57

rajeshpandey2053 changed the title ~~Multithreading~~ Implement Multithreading for Enhanced Performance in Custom Check Processing Apr 24, 2024

rajeshpandey2053 self-assigned this Apr 24, 2024

xhagrg reviewed Apr 25, 2024

View reviewed changes

rajeshpandey2053 and others added 3 commits April 30, 2024 07:34

fix the appropriate handling of thread outputs in customer checker

eb421a4

add error handling and code organization in multi threaded code

b101caf

Add max_worker.

dfd681c

xhagrg merged commit a0fbe59 into dev May 6, 2024
1 check passed

xhagrg deleted the multithreading branch May 6, 2024 17:35

slesaad mentioned this pull request Jun 24, 2024

Release 1.2.7 #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Multithreading for Enhanced Performance in Custom Check Processing #284

Implement Multithreading for Enhanced Performance in Custom Check Processing #284

rajeshpandey2053 commented Apr 24, 2024 •

edited

Loading

xhagrg Apr 25, 2024

rajeshpandey2053 Apr 30, 2024

xhagrg Apr 25, 2024

rajeshpandey2053 Apr 30, 2024

Implement Multithreading for Enhanced Performance in Custom Check Processing #284

Implement Multithreading for Enhanced Performance in Custom Check Processing #284

Conversation

rajeshpandey2053 commented Apr 24, 2024 • edited Loading

Description:

Changes:

Testing:

Impact:

xhagrg Apr 25, 2024

Choose a reason for hiding this comment

rajeshpandey2053 Apr 30, 2024

Choose a reason for hiding this comment

xhagrg Apr 25, 2024

Choose a reason for hiding this comment

rajeshpandey2053 Apr 30, 2024

Choose a reason for hiding this comment

rajeshpandey2053 commented Apr 24, 2024 •

edited

Loading